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Abstract 



This work presents a highly-optimised computational framework for the Discrete Dipole 
Approximation, a numerical method for calculating the optical properties associated with a 
target of arbitrary geometry that is widely used in atmospheric, astrophysical and industrial 
simulations. Core optimisations include the bit-fielding of integer data and iterative methods 
that complement a new Discrete Fourier Transform (DFT) kernel, which efficiently calculates the 
matrix-vector products required by these iterative solution schemes. The new kernel performs 
the requisite 3D DFTs as ensembles of ID transforms, and by doing so, is able to reduce the 
number of constituent ID transforms by 60% and the memory by over 80%. The optimisations 
also facilitate the use of parallel techniques to further enhance the performance. Complete 
OpenMP-based shared-memory and MPI-based distributed-memory implementations have been 
created to take full advantage of the various architectures. Several benchmarks of the new 
framework indicate extremely favourable performance and scalability. 

Keywords: Discrete Dipole Approximation, optimisation, matrix-vector product, CG-FFT, parallel 
algorithm 
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1 Introduction 

The Discrete Dipole Approximation (DDA) is an intuitive and extremely flexible numerical method 
for calculating the optical properties associated with a target of arbitrary geometry, whose dimen- 
sions are comparable to or moderately larger than the incident wavelength. The basic concept was 



first introduced in 1964 to study the optical properties of molecular aggregates (jPeVoe. H. . I1964L 



19651 ) . Subsequently, prompted by the polarisation of light by interstellar dust and the desire to calcu- 



late the approximate extinction, absorption and scattering cross se ctions for wavelength-sized dielec- 

trie g rains of arbitrary shape, a comparable scheme was formulated (jPurcell. E. M. and Pennvmaker. C. R. 
19731 ). Since then, the DD A has undergone major developmen t. A recent erudite review describes 



this development in detail (jYurkin. M. A. and Hoekstra. A. G.1 . 12007^ 



There are currently two main publicly available DDA implementations, DDSCAT'^ and ADDA^, 
the develo pment of which has provided s e minal contributions to the advancement of the DDA. 
DDSCAT iDraine. B. T. and Flatau. P. J.I . |2004[) is written in FORTRAN 77 and is, at present, 
the most widely used DDA implementation. It has been developed by Bruce Draine from the 
Princeton University Observatory and Piotr Flatau from the Scripps Institution of Oceanography, 
University of California, San Diego. Its development provided several advancements including, 
among others, an updated expression f or the dipole polarisability that incorporated a radiative 
reaction correction dDraine. B. T\ . Il988l) . the application of DFT techniques to the iterative core 



([Goodman. J. J. et al.l . Il991l ). and the intro duction of the Lattice Dispersion Relat ions (LDR) for 



the prescription of the dipole polarisability ( Draine. B. T. and Goodman. J. . 19931). A review pa 



per by t he authors concisely summa rises these contributions (jDraine. B. T. and Flatau. P. J.l . ll994[ ). 
ADDA dYurkin. M. A. et al.l . l2007f ) is an extremely advanced C code that has been developed (over 



a period of more than 10 years) by Maxim Yurkin and Alfons Hoekstra at the University of Amster- 
dam. It was the first parallel implementation of DDA and from its initial incarnat ion has been able 



to perform sin gle DDA computations acro ss cluster s of autonomous compute rs (jHoekstra. A. G 



[1994; Hoekst ra. A. G. and Sloot. P. M. aI [1995 : Hoekstra. A. G. et al.[ . [l998[ ). This enables the 



code to overcome the restrictions on the number of dipoles that can be used due to the memory 
limitations of a single computer. The code also incorporates several integrated iterative solvers, 
seve ral prescriptions for prescribing the dipo le polarisabilities, and a host of other functional- 
ity ([Yurkin. M. A. and Hoekstra. A. G.| . [20081) . Other DDA implementations, w hich are no t pub- 



licly available, see (I Yurkin 



Lumme. K. and Rahola. J 



M. A. et all 



for further details, include SIRRI ( Rahola. J. . 1996t 



1998[): a FORTRA N 90 code that has been developed by SimuHntu Ltd 



in Finland, and ZDD ([Zubko. E. et al.l . l2003l ): a C++ code that has been developed by Evgenij 



Zubko and Yuriy Shkuratov from the Institute of Astronomy, Kharkov National University in the 
Ukraine. For interested parties, a recent paper performs an in-depth comparison of these four codes 



( Penttila. A. et all . 1200 



Put succinctly, the DDA is an extremely flexible and powerful numerical scheme which replaces 
a continuum scatterer by an ensemble of Nd dipoles situated on a cubic lattice. Each of the Nd 
dipoles is driven by the incident electric field and by contributions from the Nd — 1 other dipoles. 
Eq.[T] outlines the kernel of this scheme which is a very large and fully populated set of 3Nd complex 
linear equations. 



^0 — 



a,. 



.Nd 



(1) 



^http: / /www. astro. princeton.edu/~drainc 
http: / / www. science, uva. nl / research /scs/ Software /adda / index, html 
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Eq.[T]must be solved for the unknown polarisatfons, Pj, from which the scattering properties can 
be calculated. Here, k is the wavcnumbcr, eo is the permittivity of free space, is the polarisability 
of dipolc i, Ei„c,i is the incident electric field at dipole i, I is the identity matrix, Vij = — r^, 
rij — \rij\ and g) is the dyadic product, which is a special case of the Kronecker product for 
vectors of equal order. 



"TyTx ry^ ryr^ (2) 
rzrx r^Ty r^} j 

For systems based on three-dimensional lattices, the computational requirements are extremely 
sensitive to the size of the lattice used. Even a small increase in the dimensions of the lattice signif- 
icantly increases the computational overheads. As a result, the true potential of the host algorithm 
is often encumbered by the severe and often prohibitive burdens imposed by the computational scale 
required to produce meaningful results. In these cases, it is crucial that every possible algorithmic 
or computational enhancement be exploited in order to augment their applicability. OpenDDA is 
a novel computational framework for the DDA which has been custom-built from scratch in the 
C language. The new framework incorporates a variety of algorithmic and computational optimi- 
sations, enabling researchers to take full advantage of the different architectures that exist in the 
current generation of computational resources and beyond. Apart from the core optimisations, the 
new framework also incorporates an extremely intuitive and user-friendly interface for specifying 
the simulation parameters, including a sturdy and forgiving parser and a straightforward human- 
readable input control file. 

The paper is organised as follows. Section [2] discusses the application of bit-fielding to streamline 
the storage of the target occupation and composition data. Section [3] deals with the design and 
development of a new highly-optimised DFT kernel to calculate the matrix-vector products within 
the iterative solution schemes. Section H] discusses the custom-built iterative schemes that have 
been included to support the new DFT kernel. Section [5] describes the extension of this formulation 
to both shared-memory and distributed-memory parallel environments. Section [6] provides various 
various serial and parallel benchmarks to corroborate the assertions and illustrate the implications 
of the new optimisations. Finally, Section [7] presents concluding remarks. 



2 Bit-fielding of target occupation and composition data 

In many applications, the range provided by the standard integer data types far exceeds the maxi- 
mum range associated with the task at hand, and as a result, for large datasets, most of the bits are 
superfluous, constituting a serious waste of resources. Bit-fielding is a way to overcome this issue 
by storing data within the bits rather than the integers themselves. The entries can be set & tested 
by taking the appropriately shifted bit-mask and applying the relevant bitwise operator, LOGICAL 
OR for setting & LOGICAL AND for testing. The simplicity and speed of bitwise operations makes 
bit-fielding an extremely efficient method of storing non-negative integer based data. 

The DDA's theoretical strength, which is also its computational weakness due to imposed com- 
putational overheads, is its volumetric nature. Its ability to treat arbitrarily shaped particles with 
complex internal structures derives from the fact that each constituent site in the computational 
domain has separate integer flags describing its properties. Rather than storing the spatial extent 
of the target as the Cartesian coordinates of each lattice site in the computational domain and a 
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Boolean occupation flag to indicate whether or not the site is occupied, requiring four integers, 
OpenDDA simply stores a single bit-fielded occupation flag per lattice site. The Cartesian coordi- 
nates need not be stored explicitly as, if required, are readily available as the loop control variables 
or via direct extraction from the array index itself. The optimisation of this integer data storage 
reduces the memory required for the spatial description of the target by a factor of 128. 

The DDA also requires the specification of the material properties of the target on a dipole-by- 
dipole basis. This is done by associating each complex refractive index value with a distinct integer 
flag, and for each dipole, storing three integers to define the material properties in the x, y & z 
directions. To compress this composition data, OpenDDA employs an extension of the bit-fielding 
theory to non-Boolean data sets. The new framework automatically calculates the requisite bit- 
mask, i.e., for four disparate materials, — > 3, two bits are all that is required, and the bit-mask 
is Sdec = llbin- For this example, the memory required for the integer data associated with the 
compositional description of the target is reduced by a factor of 16. 

To put the optimisations in perspective, for K = J = P = 500, N = KxJxP = 125 x 10^, where 
K , J &^ P are the x, y & g dimensions of the computational domain, respectively, with four mate- 
rials, the memory requirements for the integer data associated with the spatial and compositional 
description of the target are reduced from 3.26GB to just over 0.1GB. Furthermore, the compu- 
tational overhead incurred by the application of this bit-fielding optimisation within OpenDDA is 
negligible. Firstly. OpenDDA does not require the actual Cartesian coordinates within the iterative 
core. They are only needed at the initialisation stage for the calculation of the incident electric field 
at each of the lattice sites. Secondly, due to the nature of the iterative methods and the fact that 
the iterative solution vectors only include entries corresponding to occupied sites (discussed in 
the Boolean occupation flags are only tested at the initialisation stage, during the calculation of the 
dipole polarisabilities, and once at the start and end of the DFT matrix-vector product routine. Al- 
though the integer data constitutes only a fraction of the total memory footprint, with most deriving 
from the requisite complex vectors, these optimisations are not without merit and fundamentally 
underpin the efficiency of the current distributed-memory implementation of OpenDDA. To facil- 
itate the elimination of zero storage within all iterative solution vectors, each autonomous node 
requires local knowledge of the entire spatial and compositional specification of the target and thus, 
unlike the complex vectors within the iterative core, the target specification is not distributed. Bit- 
fielding allows the full description of the target to be known globally without a prohibitive overhead. 
With each node having typically a few gigabytes to work with, the original memory requirements 
associated with the description of the target would have been comparable to the maximum possible 
allocation per node, negating the possibility of a viable distributed implementation. 



3 Kernel optimisation 



Due to the sheer size of the system produced by Eq. [T] iterative 0{h[g]'^) methods, where g = 3Nd 
and h is the number of iterations required to reach convergence, are employed, which are much faster 
than direct 0{g^) methods for large Nd and h <^ Nd- The cardinal computational hurdle and key to 



the acceleration of these iterative methods , such as the conjugate gradient method ([Barrett. R. et al 



1994 : Golub. G. H. and Van Loan. C. F. . 19961 ). are the matrix- vector products required by these 



schemes. A major breakthrough in this respect was the realisation that by embedding the arbitrarily 
shaped target in aA^=: A' xJxP computational domain and neglecting the diagonal terms, the 



matrix- vector products can be reformulated as convolutions (jGoodman. J. J. et al.l . ll99lh . Once the 



convolutions have been performed, it is then a trivial matter to include the contributions from the 
previously neglected terms. 



T he solution method is predicated upon To eplitz and circulant matrix structure (jDavis. P. J 



1994 iGolub. G. H. and Van Loan. C. F.I 



f on 10 



A Toeplitz matrix of order [n], denoted '"'T, is 
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simply a matrix with constant diagonals, whereas a circulant matrix, '"'C, is a special case of 
Toeplitz structure, where the last element of each row is the same as first element of the next row. 
Any Toeplitz matrix, '"'T, can be converted to its circulant counterpart, 'C, where n' = 2n — 1, by 
taking the first column, appending the entries of the first row, excluding the initial entry, in reverse 
order, and completing the diagonal s. The key is that any circulant matrix is diagonalised by the 



corresponding Fourier matrix, '"''F (jVan Loan. C. F.l . ll992f ). where '"''c is the first column of '"''C. 

f"'lF = -^u:P\ p, g = 0, . . . , n' - 1, = e-^ 

["'1F["'1C[-'1fH ^ diag [1"''F] 1"'1c) = diag (Ac) (3) 

Using the fact that the Fourier matrix is unitary, 'F^^ ~ ["'ipH^ -j-j^^ multiplication of an 
arbitrary vector with a circulant matrix can be performed as an element-wise multiplication in the 
Fourier Domain (FD) via the Convolution Theorem, / ^ F, g ^ G, f * g ^ F.G. The symbol 
denotes convolution, denotes a Fourier transform pair and the symbol '.' implies element-wise 
multiplication. 

["'ly = ["'iC'"'ix = t"'lF-ld^ag (Ac) i"'iF'"''x = i"''F^i[Ac].['"''F'"''x] (4) 

Not only docs this reduce the computational complexity of the matrix- vector product from O(n^) 
to 0{n' Inn'), but since the circulant eigenvalues are a scalar multiple of the DFT of the first column 
of the circulant matrix, sec Eq. [3l the memory is reduced from O(n^) to O(n'). To expedite the 
calculation of a Toeplitz matrix-vector product, '"'y = ["iT'"'x, the matrix must undergo Toeplitz- 
to-circulant (T-to-c) conversion, '"'T — > '"''C, and the vector zero-padded to the dimension of the 
circulant matrix by appending n — 1 zeroes. Eq. Ulcan then be applied. The first n elements of the 
resultant are the required answer for the original Toeplitz matrix- vector product, with the remaining 
71 — 1 entries being redundant. Please note that this methodology and the following discussion 
is only applicable to systems exhibiting Toeplitz structure. Non- Toeplitz structure negates the 
transformation to circulant structure and hence, the use of DFTs to calculate the matrix-vector 
products efficiently In terms of the specific geometry of the target in question, the method is 
indifferent. For sparse systems, from targets with a low volume filling fraction (VFF), such as 
dendritic or highly porous structures, the only difference is the amount of memory required by 
the iterative solution vectors. The solution vectors within the iterative core do not store entries 
corresponding to vacant lattice sites, a fact that will be discussed in 21 However, the DFT kernel 
that efficiently calculates the matrix-vector products required by the iterative solvers must account 
for the entire rectangular computational domain, including the vacant sites, and thus performs the 
same amount of calculations, irrespective of the VFF. 

For the DDA, by embedding the arbitrary target in a rectangular lattice and neglecting the diag- 
onal terms, the requisite matrix-vector products take on the form of a complex-symmetric level-three 
tensor- Toeplitz matrix of order [P] [J] [K], denoted i^n-'KJfi'p^ which, the constituent tensor elements 
are complex-symmetric but not necessarily Toeplitz, multiplied by an arbitrary vector of length N, 
denoted '"'x, whose constituent elements are Cartesian vectors. Computationally, l^HJKJfi'j Qa,n be 
considered as an assemblage of nine complex-symmetric level-three single-element Toeplitz matrices, 
and since the tensor elements are symmetric, only six of the nine tensor components are indepen- 
dent and need to be considered explicitly, xx, xy, xz, yy, yz, zz =i^'H./imTO, i^K'^n^iTl, . . . , i^n-nmr^^^ 
Analogously, the arbitrary vector can be split into three separate x, y, z component vectors, '"'xO, 
■"'xl, '"'x2. Consequently, this means that the components can be treated separately and the 
full problem is reduced to an ensemble of smaller single-element problems whose contributions will 
ultimately be manifested through FD products. 
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At this point, before attempting to decipher the ensuing optimisations of this DFT technique, for 
readers who are not famihar with the specific algorithmic details associated with the application of 
DFTs to the solution of such matrix- vector systems, it would be prudent to read Appendix [X] This 
concisely summarises the conventional application of DFTs to these complex-symmetric level-thr ee 
tensor- Toeplitz systems in the context of the DDA, as prescribed bv lGoodman. J. J. et al.l ( 1991 ). 

The complete standard algorithm, which encapsulates the conventional implementation, can be 
summarised as follows: 



1. T-to-c conversion and 3D DFT of the six independent tensor components 

2. Zero pad and 3D DFT of the three zcro-paddcd vector components 

3. FD element- wise multiplication of the tensor and vector components 

4. 3D iDFT and normalisation of the components of the resultant 

5. Addition of the previously ignored diagonal contributions 

Due to the T-to-c conversions that are required for DFT applicability, the six tensor-components 
arrays contain a high degree of symmetry. Since a. K' x J' x P' 3D DFT can be carried out as 
an ensemble of ID DFTs, J' x P' in the x, K' x P' in the y, and K' x J' in the z direction, a 
new algorithm which exploits these symmetries can be constructed. Rather than expanding the six 
tensor-component arrays from a K x J x P Toeplitz structure to a i^' x J' x P' circulant structure 
and applying 3D DFTs, as delineated in Fig. O the T-to-c conversion and subsequent DFT can be 
amalgamated into a single computationally efficient operation. Each of the lines in the x direction, 
^"hOsDj^kJ = 0, J - l,fc = 0, ...,P- 1, in the y direction, i'it03Di,fc,i = 0, . . . , if - 1, fc = 
0, . . . ,P — 1, and in the z direction, '^'tOsDi^, i = 0, . . . , /\ — 1, j = 0, . . . , J — 1, in section '1' 
in Fig. can be taken in turn and copied into an external ID array of length K' = 2K — 1, 
J' = 2J — 1, and P' = 2P — 1, respectively. Each can be treated as the first column of a level-one 
Toeplitz matrix and the T-to-c conversion performed. The DFTs of these external vectors can be 
computed, and subsequently, the first K, J and P elements of the resultants can be copied back 
into the 3D tensor-component array, overwriting the original data. This significantly reduces the 
number of ID DFTs that have to be performed. Taking into account the zero-padding to ensure 
that K" = K' + kf = 2K -1 + kf, J" = J' +jf = 2J - 1+jf, and P" = P' + pf = 2P - 1 + pf 
permits the use of fast DFT algorithms, the number of transforms is reduced by approximately 75% 
in total: 



X direction, see Fig. [IK a): 6 [J" x P"] 
y direction, see Fig. [l](b): 6 [K" x P"] 
z direction, see Fig. [l](c): 6 [K" x J"] 



24[JxP] 6 [J X P] 



24 [K X P] 



24 [K X J] 



6[{K + kf) X P]^ 6[K X P] 



6 [{K + kf) X ( J + j/)]~ 6 [if X J] 



Furthermore, since the full T-to-c conversion, as delineated in Fig.jHl is never explicitly performed, 
only section T' in Fig. [Hlnced be stored, reducing the complex number storage required for the six 
tensor-component arrays by approximately 



6 [K" X J" X P"] ~ 48 [if X J X P] ^6 [{K + kf) x ( J + jf) x {P + pf)] ~ 6[K x J x P] 



For the three zero-padded vector-component arrays, the realisation that sa | of each is by very 
definition, identically zero, c an be used to expedite the calculations . This was originally proposed by 
Hoekstra. A. G. et all (|l998h and has been exploited within ADDA (jYurkin. M. A. and Hoekstra. A. G 
20081 ). Within OpenDDA. for computational efficiency, the DFTs of the three vector components. 



the subsequent element-wise FD multiplication of the tensor and vector components, and the iDFTs 
of the resultant, arc amalgamated into a conjoint operation. Unlike the tensor component DFTs, 
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which are performed m the x, y & z dhections, the DFTs of the vector components are performed 
in the y, x & z directions, and the iDFTs are performed in the z, x & y directions. The reason for 
this is twofold. Firstly, it allows the use of a x^-scratch plane for the DFTs in the x & z directions, 
the FD multiplication, and the iDFTs in the z & x directions, which provides significant memory 
reductions, as only « i of each of the three full K" x J" x P" arrays need be stored. Secondly, it 
supports the distributed FD multiplication in the MPI version of the algorithm (discussed in Q . As 
a result of this optimisation, the complex number storage for the three vector components, ignoring 
the storage required for the three xz-scratch planes, is reduced by approximately 75%: 



3 [K" X J" X P"] ~ 24 [if X J X P] -> 3 [if X J" X P] ~ Q[K x J x P] 



Each of the three vector components is embedded into a zero-padded [K x J" x P] array, K 
in the x direction, J" in the y direction, and P in the z direction. Once embedded, the DFT 
is performed in the y direction. Then, each of the [K x P\ xz-planes is, in turn, copied into a 
[K" X P"] external zero-padded scratch sz-plane. Subsequently, the DFTs in the x & z directions, 
the element-wise FD multiplication of the tensor & vector components, and the iDFTs in the z & 
X directions are performed 'on-the-fly' for this xz-plane. The initial [K x P] section is then copied 
back into the 3D vector component array, overwriting the original data. Finally, once this has been 
done for each of the J" xz-planes, the iDFT in the y direction is performed. 

The reason that this is all possible is due to the fact that only non-zero DFTs, and only iDFTs 
that will contribute to the final answer, need to be performed. As far as the DFTs are concerned, 
in the y direction, only the lines '"'''xOsDi^fc, i — 0, . . . , if — 1, fc — 0, . . . , P — 1, see Fig. [T] (d), 
contain any non-zero values and need be considered. Subsequently, in the x direction, only the lines 
'xOsDj^fc, j = 0, . . . , J' — 1, A: = 0, . . . , P — 1, see Fig. [1] (e), need be transformed, and in the z 
direction, all the lines 'xOsDi.j, « = 0, . . . , A'' — 1, j = 0, . . . , J' — 1, see Fig. [1] (f), contain non-zero 
values and must be included. Taking into account the zero-padding to ensure that K" = K' + kf ~ 
2K-1 + kf, J" = J' +jf = 2J -1+jf, and P" = P' + pf ^ 2P - I + pf permits the use of fast 
DFT algorithms, the number of transforms is reduced by approximately 42% in total: 



y direction, see Fig. [6](d): 3 [K" x P"] 
X direction, see Fig. [6](e): 3 [J" x P"] 



12 [K X P] 



3 [a: X P] 



12 [J X P] 



z direction, see Fig. [6](f): 'no reduction possible^ 



3 [J" X P] ~ 6 [J X P] 



The FD element-wise multiplications can be carried out just as detailed in Eqs.El^IHlin Appendix 
El with one minor caveat. The six independent tensor-component arrays never underwent the full T- 
to-c conversion, however, since the DFT preserves the mirror symmetries, when accessing YO Y5, 
the pertinent data can be extracted using the criteria: 

. _ f I, if i < AT; . f j, if j < J; l. _ ( if fc < P; 

^~ \ k' ^i, if i > a:. ^ ^ \ J' - j, if j>J. \P' - k, if k > P. 

i = 0, . . . , a:' - 1 j = 0, . . . , J' - 1 fc = 0, . . . , p' - 1 

For the iDFT of the resultant of the FD multiphcation, i.e., of QO, Ql & Q2, see Eqs. H] — > 
|S] in Appendix El only transforms that will contribute to the final requisite answer need to be 
performed. In the z direction, all the lines 'QO,; j,i = 0, . . . , AT' — 1, j = 0, . . . , J' — 1, see Fig. |T] 
(f), contribute, through the subsequent x and y transforms, to the final answer. Subsequently, 
in the x direction, only the lines '^''QO^^.,.? = 0, . . . , J' — 1, A: = 0, ...,P — 1, see Fig. [T] (e), 
contribute, through the subsequent y transform, to the final answer, and finally, in the y direction. 
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only the lines ''''QOj 



0, . . . ,K — l,k = 0, . . . ,P — 1, see Fig. [T] (d), contribute and need to be 



considered. Taking into account the zero-padding to ensure that K" = K' + kf = 2K — 1 + kf, 
J" = J'+jf = 2J-l+j/, and P" = P'+pf = 2P-l+pf permits the use of fast iDFT algorithms, 
the number of transforms is reduced, as with the forward DFTs, by approximately 42% in total: 



z direction, see Fig. [6](f): 'no reduction possible' 
X direction, see Fig. [6](e): 3 [J" x P"] 
y direction, see Fig. [6](d): 3 [K" x P"] 



12 [J X P] ^3 [J" X P] 



12 [K X P] 



3[Kx P] 



6 f J X PI 



The approximate complexity improvements for both computation and storage are given in Table. [2] 
Please note that for all three architectural variants, serial. OpcnMP (shared-memory), and MPI 
(distributed- memory), the DFT functionality required by the iterative core is provided by the ad- 
vanced com plex interface of the extremely efficie nt, highly-portable and well known FFTW package 
(version 3) (|Frigo. M. and Johnson. S. G.I . I2005I ). 



4 Iterative schemes 



To avoid unnecessary changes in data format to comply with the requisite structure of some external 
package, which would only serve to subvert any and all attempts to maximise the performance and 
streamline the structure of the framework, (like similar DDA implementations) OpenDDA incor- 
porates a relatively large selection of iterative schemes that have been specifically custom-built to 
integrate seamlessly with the new DFT matrix-vector kernel and support the custom-built domain 
decomposition and parallel memory allocation schemes in the MPI implementation. Although a 



variety of different iterative solve r s have been employ e d bvlDraine. B. T. and Flatau. P. J.I (Il994r): 
Lumme. K. and Rahola. J.I (ll994l):[Fiatau. P. J.I (fl997i):lNebeker. B. M. et all (ll998l) :lRahola J.I (11998): 



Fan. Z. H. et all l|2006f ): IPenttila. A. et all (|2007l ): lYurkin. M. A. et all (|2007l ). to date, no defini- 
tive "best" algorithm has been identified. To permit further investigation and to maximise the 

Conjugate Gradients (CG) 
Conjugate Gradie nts Squared (CGS) 



(Hestenes. M. R. and Stiefel. E.. 1952: 


Shewchuk. J. R.I. 


(Sonneveld. P.. 19891: Barrett. R. et al. 


. 19941). BiConiu 


Barrett. R. et al.. 19941). BiCG for symmetric systems 



1991. 



Fletcher. R.I.I197. 



BiCGgvm) (|Freund. R.1 . 119921 ). Stabilised 



the BiCG (RBiCGSTAB) (jSleiipen. G. L G. and Fokkema. D. r1 



Restarted, stabilised version of 
19931). Qu asi -minimal residual 



with coupled two-term recurrences (OMR) (|Freund. R. and Nachtigal. N.l . ll99ll . ll992tlBucker. H. M. and Sauren. M 



9961). OMR for sy mmetric systems (QMR^y^) (jPreund. R.l . ll992h . Transpose-free QMR (TFQMR) 
Freund. R.l.ll993l) and a variant of BiCGSTAB based on multiple Lanczos starting vectors (ML(n)BiCGSTAB) 
(|Yeung. M. and Chan. T.n . ll999l) . 



For computational efficiency, as in ADDA (jYurkin. M. A. et al.l . l2007l : lYurkin. M. A. and Hoekstra. A. G 



20081 ). the new framework only stores information corresponding to occupied lattice sites, i.e., the 
elimination of zero storage within all iterative solution vectors. For the majority of cases, this 
provides a significant reduction in the memory required to store these vectors, as the target being 
simulated does not fully utilise the computational domain, generally having fewer occupied sites 
towards the exterior. As a quick example, consider a spherical target based on an 18 x 18 x 18 
computational grid. Although the grid contains 5832 lattice sites, for a spherical target, only 3112 
are occupied, and dumping the unoccupied data, which is identically zero, equates to a decrease 
of ^ 47% in the memory required for the iterative vectors. To facilitate this optimisation, data 
extension and compression algorithms were created to extend a target vector out to allow the DFT 
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matrix- vector product, and subsequently, to compress the resultant back into the computationally 
efficient form with just the entries corresponding to occupied sites. 

5 Parallel implementations 

Over and above the serial improvements, due to the fact that the DFT kernel has been reduced from 
using 3D to ID transforms, parallel techniques can be used to further enhance the performance. Full 
OpcnMP-based and MPI-based implementations of the new framework have been created to take 
proper advantage of both shared- memory and distributed- memory architectures. 

For the shared-memory implementation, for thread safety, each thread must have its own DFT 
scratch arrays. Furthermore, since the aforementioned data-compression algorithm at the end of 
the DFT kernel operates in-place within the result buffer, to prevent pertinent data loss by multiple 
processes working within the same array, it has a maximum scalability of three, i.e., one thread per 
vector component. 

For the distributed-memory version, custom 'cyclic-plane' domain decomposition and parallel 
memory allocation schemes have been developed, which, apart from permitting the efficient loca- 
tion of, and access to, pertinent data, also provide a decrease in memory per node that is directly 
proportional to the number of available nodes and inherent 'to-the-nearest-plane' load balancing. 
This is an important point. For the majority of cases, targets do not have a constant number of 
occupied sites per plane. Since the custom-built iterative solution schemes only store information 
corresponding to occupied sites, 'cyclic-plane' decomposition provides a much improved data distri- 
bution across the available nodes. For example, take the previously discussed spherical target based 
on an 18x 18x 18 computational grid, in which only 3112 of the 5832 lattice sites are occupied. If 
this computational domain is 'block' decomposed across three nodes, then even for this simplistic 
case, the imbalance of ~ 47% in the data distribution dwarfs the negligible ~ 3% imbalance for the 
'cyclic-plane' decomposition, see Fig. [2] It is conceded that by 'block' decomposing the constituent 
planes non-uniformly among the three nodes, so that node '0' is assigned the first seven planes, node 
'1' is assigned the four middle 4 planes, and node '2' is assigned the remaining seven planes, the 
load imbalance is reduced to a mere 2%. However, this requires that the domain decomposition and 
parallel memory allocation algorithms be aware of, and be able to take account of, the user-defined 
and essentially arbitrary target geometry. In practice, this has not been the case. 

To facilitate the distributed DFTs and iDFT required by the new kernel, the new framework 
incorporates several in-place local and distributed transpose algorithms, which have all been custom- 
designed to support the 'cyclic-plane' decomposition. They are all capable of treating completely 
arbitrary systems decomposed across an arbitrary number of nodes. Please note that, unlike the 
serial and OpenMP versions, which only store section '1' in Fig. O for each of the six independent 
tensor components, the MPI version also stores an extra octant. Rather than storing a [K x J x P] 
array for each of the six components, the MPI version stores a [K x J" x P] array, K in the x 
direction, J" in the y direction, and P in the z direction. This is to facilitate the distributed 
element-wise FD multiplication, see Eqs. IH-^IS] in Appendix 1X1 that is required by the DFT kernel. 
This extra memory requirement is not an issue as the linear decrease in the memory per node, 
afforded by the parallel memory allocation, makes the extra storage inconsequential. 

6 Benchmarking 

On top of the architectural specific nature of the implementations, each also exists in float, double 
and long double precision. For all benchmarks, double precision was used and level three and inter- 
procedural optimisations were used for the compilations, i.e., '-03 -ipo' for the Intel compilers and 
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'-03 -ipa' for the Pathscalc compilers. Furthermore, to standardise the results, all benchmarks were 
run using the same set of arbitrarily chosen physical parameters, SPHERE, incident wavelength 
A = 3.175 /im, incident polarisation eo = x, effective radius (radius of a sphere of equal volume) 
a = 0.5 fim, and complex refractive index m = 1.63631 + 0.372i^. The following list details both 
the particular hardware and software configurations of each of the benchmarking systems. For each 
benchmark, the relevant architecture will be referenced simply by the associated tag, FRANKLIN, 
NEMO or WALTON. 

• FRANKLIN: Shared-memory, SGI Altix 3700, Red Hat Enterprise Linux AS release 3, lA- 
64, 24 X Intel Itanium 2, 1.5 GHz, 6 MB cache, 96 GB RAM, Intel Compilers 9.1, FFTW 3.1.2, 
Intel MKL 8.1.014. 

• NEMO: Shared-memory, HP xw9300 Workstation, openSUSE 10.2, x86_64, 2 x AMD Dual- 
Core Opteron 275, 2.2 GHz, 2 MB(2 x 1 MB) cache, 16 GB RAM, Intel Compilers 9.1, FFTW 
3.1.2, MPICH2 1.0. 5p4. 

• WALTON^: Distributed-memory, IBM cluster 1350, SUSE LINUX Enterprise Server 9, 
x86_64, 476 compute nodes, 952 x AMD Opteron Single-Core Opteron 250, 2.4 GHz, 1 MB 
cache, 412 have 4GB RAM per node, 64 have 8GB RAM per node, QLogic PathScale Compiler 
Suite, Version 3.0, FFTW 3.1.2, MPICH2 1.0.2. 

To affirm the benefits ascribed to the new DFT kernel, the constituent optimisations have been 
applied sequentially, on FRANKLIN, to show their respective contribution to the cumulative effect. 
Fig. [3{a) shows the time to compute the matrix-vector product for the original, various hybrid, and 
new algorithms in a regime that allows comparison with the explicit matrix-vector multiplication. 
Fig. [3|b) shows long-range timings, where the algorithms have been applied, within their aptitude, 
to much larger systems. For both the short-range and long-range cases, the timings for the 3D 
(Grig) and ID (Grig) algorithms are essentially identical. Since the optimisation of the kernel is 
predicated on the concept of removing superfluous transforms by performing the requisite 3D DFTs 
as ensembles of ID transforms, this congruency is crucial as it shows that there is no computational 
overhead associated with the migration. 

For the short-range case, even thou gh the explicit ma t rix-ve ctor multiplication was performed 
with the highly optimised Intel MKL ( Intel Corporation! . 12007 ). it is obvious that the sheer size 



of the systems involved negate its use almost immediately. Furthermore, for both plots, the se- 
quential application of the components of the new algorithm is clearly discernible. To put these 
improvements in perspective, in comparison to the original algorithm, the new algorithm requires 
approximately 60% less transforms in order to carry out the matrix- vector product, which has a 
comparably profound influence on the timings. 

In the long-range case, the comparison of the original and new algorithms is limited to the 
nethermost region of the plot. This is due to the prohibitive memory requirements associated with 
the original algorithm. Note that even for extremely large systems, the new algorithm is notably 
faster than the original algorithm is for a system that is just a fraction of the size. Both the 
short-range and long-range cases also include timings for the OpenMP-based kernel across 4, 8 & 
16 threads, which unequivocally show the substantial performance gains that can be obtained from 
shared-memory parallelisation. 

An ancillary computational benefit, as shown in Fig. ^c), is that the new formulation only 
requires the creation of a few ID DFT plans, rather than full 3D DFT plans. This is especially 



^ Complex refracti ve index of ice for A = 3.175^m from the REFICE routine, originally developed by Warren in 1984 
IWarren. S. G.I l|l984l ). and subsequently improved by Warren, Gao and Wiscombe in 1995 (Unpublished, source code 
available in the PyARTS package, a python package developed to compliment the Atmospheric Radiative Transfer 
Simulator - ARTS. Available from the Satellite Atmospheric Science Group, http: / / www.sat.l tu.se/arts/toolsJ . 

^ICHEC distributed cluster, Irish Centre for High-End Computing, http:/ /www.ichec.ie| 
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beneficial if using FFTW (|Frigo. M. and Jofinson. S. G.I . Il998l l2005h with FFTW.MEASURE to 
compute tlie transforms, which instructs FFTW to run and measure the execution time of several 
DFTs in order to find the best way to compute the transform. As a result, initialisation times 
become negligible. 

In comparison to the original algorithm, the new algorithm requires over 80% less memory to 
store the tensor and vector components. This is evident from the approximate memory requirement 
presented in Fig. [3]^d); where, for the maximum benchmarked size, corresponding to K = J = 
P = 610, although the original 3D DFT algorithm would require over 257GB. the new algorithm 
and its OpcnMP counterpart only require 52GB to perform the same calculation. This figure also 
incorporates the memory requirements for the distributed MPI-based implementation, showing the 
reduction in memory per node that is only limited by the number of nodes available. 

To ascertain the performance and scalability of the OpenMP-based shared-memory implementa- 
tion, a large-scale benchmark was performed across 1, 2, . . . , 20 threads on FRANKLIN, see Fig. [H 
The system size was chosen to be K = J = P = 600, which produced a computational domain with 
A^ = 216 xlO^ sites in total and a representation of the spherical target with A^cj^ 113, 098, 912 occu- 
pied sites. In terms of memory usage, this system required 49GB for 1 thread, with a linear increase 
of approximately 66MB per thread to approximately 50GB for 20 threads. The extra memory is 
for the thread private scratch spaces that arc required to ensure thread safety. Although this may 
sound like large amount of memory, in comparison, for the same system, the original algorithm, 
which is limited to the serial environment, would require approximately 242GB. The time for the 
matrix-vector product is reduced from 2576s for 1 thread to just 156s using 20 threads, a speedup of 
approximately 16.5. All components show excellent scalability, except of course, the aforementioned 
data compression algorithm, see 33 which is not an issue as it constitutes only a negligible fraction 
of the DFT kernel's runtime. 

To test the performance and scalability of the MPI-based distributed-mcmory implementation, 
the same K = J ~ P = 600 system as for the OpenMP shared-memory benchmark, was run on 
WALTON, see Fig. H) Here, due to the fact that each node has only 4GB of physical memory, the 
benchmark was initialised across 32 nodes and then scaled up to 64 nodes, requiring only 2.2GB per 
node across 32 nodes, and just 1.2GB per node across 64 nodes. In terms of performance, the timings 
and scalability are extremely good. The calculation took 108 s across 32 nodes and 55 s across 64 
nodes, a speedup of almost 2. To put this in perspective, for the OpenMP-based benchmark, the 
same system required 2576 s using 1 thread and 156 s using 20 threads. In terms of these results, 
there are several noteworthy points. Firstly, the total time is quoted as the sum of the computation 
and communication times. The total communication time includes the time for the forward trans- 
positions of both the six independent tensor components and three zero-padded vector components, 
and also includes the time for the reverse transposition of the resultant of the FD element-wise mul- 
tiplication of the components. These transpositions are essentially block events that are controlled 
by quite complex custom-built transposition algorithms. Even with the prescribed optimisations, 
due to the enormity of the tensor and vector component arrays, 'in-place' transposition is an ab- 
solute necessity. This complication is compounded by the custom-built parallel-memory allocation 
and 'cyclic-plane' domain decomposition schemes that provide a decrease in memory per node that 
is directly proportional to the number of active nodes and also provide inherent load balancing. In 
order to facilitate efficient 'in-place' transposition of the component arrays, the transposition algo- 
rithms employ the pre-computation of local 'in-place' transposes in order to simplify and streamline 
the host algorithm and subsequent data communication. Unfortunately, at present this negates the 
possibility of overlapping some of the communication with the computation. Secondly, although 
all communication timings arc marginally sub-linear, all computation timings exhibit super-linear 
speedups. Upon consulting with the ICHEC^ technical systems team, the peculiarity is believed 



Irish Centre for High-End Computing, |http:/ /www.ichec.ie| 
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to be an artifact of the initial benchmark. Due to the novel parallel memory allocation algorithms 
in the distributed-memory implementation of the OpenDDA framework, the memory footprint is 
decomposed across the active nodes. However, it is believed that the chosen system size and asso- 
ciated memory requirements were fractionally too large, causing slightly degraded performance due 
to swapping. Since all subsequent speedups are calculated with respect to this initial benchmark, 
and since the parallel memory allocation and domain decomposition schemes would have solved this 
allocation/swapping issue as the number of active nodes increased above 32, the subsequent timings 
would appear super-linear. This is not considered to be a cause for concern as this only introduces 
a global shift and it is to overall trend that is of importance here. 

To test the performance of the full OpenDDA framework, a benchmark was run on NEMO with 
K = J = P = 200, see Table. |3l This produced a computational domain with TV = 8 xlO^ sites in total 
and a representation of the spherical target with Nd~^, 188,896 occupied sites. The convergence 
tolerance was taken as the default value of 1 x 10^^°, the maximum number of iterations was set to 
1x10^, the initial guess was set to zero, and Point- Jacobi preconditioning was used. The performance 
and scalability of the full OpenDDA framework is extremely good. For this system, it is the simpler 
schemes that tend to provide the best performance, ho wever, since convergence tends to be slower 



for larger refractive index values (jDraine. B. T.1 . 119881 ). as the number of iterations increases, the 



enhanced stability of the more involved iterative methods may prove beneficial, and ultimately, may 
provide superior convergence. Although it is duly noted that due to the necessity of distributed 
transposes, the performance of the MPI version will always lag behind that of the OpenMP version, 
the slight deficit in performance is offset by the decrease in memory that is directly proportional to 
the number of nodes available. 

To put the new custom-built OpenDDA framework in context, several comparative benchmarks 
were performed between the various architecture specific implementa tions of OpenDDA, and ADDA; 



the fastest and most computationally efficient DDA implementation (jPenttila. A. et al.l . 120071 ). Since 



the number of iterations to reach convergence is dependent on the user-defined convergence tolerance 
and on the way in which the stopping criteria are defined within the specific implementation, the 
comparison focusses, not on the number of iterations, but on the time per iteration. For ADDA, 
the reason that only the total time for the initialisation of the tensor components is given is that 
ADDA performs the initialisation of the tensor components as a single conjoint operation, i.e., 
the computation of the components and the DFTs of these components are perform together. In 
terms of the quoted memory requirements, for the OpenDDA variants, the approximate memory 
requirements are obtained from the in-built estimation routines. To obtain an estimate for the ap- 
proximate memory requirements for ADDA, only 'large' arrays pertaining to the iterative solution 
have been included. Furthermore, some of the 'large' arrays have been excluded from considera- 
tion, as in ADDA, not all relevant 'large' arrays coexist, and thus, some do not contribute to the 
cumulative memory requirements. For the shared-memory comparison, for BICGsym, the mem- 
ory estimations for ADDA include, [xvec, rvec, pvec, Einc, Avecbuffer, Dmatrix, Xmatrix, slices & 
slices-tr], and for BiCGSTAB & QMR^yj^^, the estimation also includes [vecl, vec2 & vecS]. For the 
distributed-memory comparison, the estimations also include the contributions from ADDA's MPI 
communication buffers, [BT_buffer & BT_rbuffer], and to look at the efficiency of the decomposition 
of the associated data across the active nodes, apart from the total memory requirements, the results 
also give the maximum memory requirement per node. 

Table U] shows the results for the shared-memory comparative benchmark run on NEMO using 
BiCGSTAB, BICGsym & QMR for a spherical target with i^ = J = P = 200, iV = 8 x 10^, and 



Nd = 4, 188, 896. Please note that although ADDA docs employ Intel vectorisation (|Bik. A. J. G 



20041 ) within its basic linear algebra routines, it does incorporate generic OpenMP shared-memory 



parallelisation. Similarly, Table [6] shows the results for the distributed-memory comparative bench- 
mark run on WALTON using BiGGSTAB, BIGGsym & QMR^y^^, across 4, 8, 16, 32 & 64 nodes, 
where the relevant system sizes for each of the constituent benchmarks are quoted in Tabic O In 
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all instances, the new OpcnDDA performs extremely favourably. It is worth noting that the custom 
'cyclic-plane' decomposition within the OpenDDA framework allows for an extremely even distri- 
bution of the data across the active nodes. Even though, for the comparison across 32 nodes in 
Table El OpenDDA requires marginally more memory in total, the maximum memory required by 
any individual node is actually less. 

Finally, as a quick verification of the correct operation of the new OpenDDA framework, a 
comparison of the extinction, scattering & absorption efficiencies, Qext, Qsca & Qabs, for the well 
proven output of the widely used DDSCAT code and the three architectural variants of OpenDDA 
was performed on NEMO, see Table [71 All arc in good agreement. 

7 Conclusions 

Within the new framework, the bit-fielding of the storage associated with the spatial and compo- 
sitional specification of the target, not only provides significant reduction in the associated integer 
storage, but also facilitates the creation of a versatile and efficient distributed-memory implementa- 
tion. 

Although comparable codes, such as ADDA, have pioneered and employ similar optimisations, 
OpenDDA incorporates a custom-built highly-optimised DFT kernel that performs the requisite 
3D DFTs of both the tensor and vector, which are instrumental in the efRcient calculation of the 
matrix-vector products, as ensembles of ID transforms. By doing so, a significant fraction of the 
constituent ID transforms, which are superfluous, can be neglected. In comparison to the "out- 
of-the-box" algorithm based on 3D DFTs, OpenDDA requires approximately 60% less transforms 
and over 80% less memory in order to carry out the matrix-vector products. Furthermore, the 
optimisations permit parallel techniques to be employed to further enhance the performance. Both 
the resulting OpenMP-based shared-memory and MPI-based distributed-memory implementations 
possess extremely good scalability and associated performance. On top of these improvements, 
due to the fact that, computationally, the new framework only requires the creation of a limited 
number of ID DFT plans, rather than the full 3D DFT plans required by the original algorithm, the 
initialisation times, which can be significant, become completely negligible. The new framework also 
incorporates a relatively large selection of iterative methods that have been custom-built in serial, 
OpenMP and MPI to integrate seamlessly with the various architecturally specific DFT kernels. 

Within the distributed-memory implementation of the new framework, the novel parallel-memory 
allocation and domain decomposition schemes provide a decrease in memory per node that is directly 
proportional to the number of nodes and also provide inherent load balancing. Due to the fact that 
the new framework only stores information corresponding to occupied lattice sites, depending on 
the target being simulated, there can be quite an imbalance in the number of occupied sites per 
plane. However, the new framework employs custom-designed 'cyclic-plane' decomposition, and as 
a result, boasts an extremely well-balanced data distribution across the active nodes. 

The framework performs extremely favourably across all architectural variants, with the novel 
and unique OpenMP-based implementation providing unprecedented parallel performance on shared- 
memory architectures. 
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A Original DFT algorithm 

This appendix concisely summarises the conventional application of DFTs to the solution of complex- 
symmetric level-three tensor- Toeplitz m atrix- vector products in the context of the DDA, as pre- 
scribed bv lGoodman. J. J. et all il99l[) . 

The extension of the level-one formulation described at the beginning of to the treatment of 
level-three systems, '"'y = i^itJi[ffix[™ix, is obtained by performing the T-to-c conversion on each of 
the three levels of i^it/i[A']^ j^g^^ .^^g^g performed for the level-one system, resulting in a level-three 
circulant, yp'\y'\yx'\C, where K' = 2K- 1, J' = 2J - 1, P' = 2P- 1. 

i=0 
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Here, each of the '-''"'^''Cfc is a level-two circulant, each of the ^"^'^Ck.j is a level-one circulant, 
the symbol '(g)' deno tes the Kronecker tenso r product, and '"'')/' is the downward-shift permutation 
matrix of order [n] (jVan Loan. C. F.I . 119921 ). The arbitrary vector must also be zero-padded to 
negate the contribution for the circulant extensions on all three levels, '"'x 'x, where [N'] ~ 
\K' X J' X P']. The complimentary Fourier structure which diagonalises this level-three system is 
acquired from the hierarchical dia gonalisation by th e Fourier matrices, '^''F, '■'''F and '^''F, i.e.. 



[p'][j'i[A-']p ^ [p'\Y ^ IT']Y (g) i^'ip (jPavis. P. J.I . Il994l ). As a rcsuh, the level-three matrix-vector 



product, '"''y = [J''i[-''i[«'ic[«'ix, can be carried out as an clcmcnt-wisc FD multiplication, where '"''c 
is the first column of i^'iu'ii^'ic. 



pP'][j'][K']-p[N'l 



(5) 



Furthermore, using the realisation that = 3J3 DFT allows Eq. [Klto be reformulated 

in terms of the 3D DFT. To facilitate this, before the aforementioned T-to-c conversions and zero- 
padding takes place, the first column of each of the six independent tensor component matrices, 
'"'to, . . . , '"'t5, and the three vector components, '"'xO, '"'xl, '"'x2, must be arranged into separate 
3D structures, '"'tOsD, . . . , '"'t53D and '"'xOsd, '"'xIsd, '"'x23D, respectively. For example, the trans- 
formations '"'to '"'t03D and '"'xO '"'xOsd are governed by '"'tOao^j.fc = ^"hOkjK+jK+i and 
'"'xOsDij^fc = '"'xOfcjif+jK+i, i 0, . . . , - 1, j 0, . . . , J - 1, fc = 0, . . . , P - 1, respectively. 

Once in 3D form, to facilitate the DFT applicability, each of the six tensor-component arrays 
must undergo T-to-c conversion, e.g., '"'tOsD '" 'cOsd, and the three vector-component arrays 
must be zero-padded, e.g., '"'xOsd — > '"''xOsd. For a level-three system, the T-to-c conversion 
involves taking each tensor-component array in turn, denoted by section '1' in Fig. [HI treating each 
line in the x, y & z directions as the first column of a level-one Toeplitz matrix, and mirroring 
it, excluding the initial entry, across its last entry. The mirrored elements must also by multiplied 
by the appropriate entry from Table [TJ The reason for this is that for a general T-to-c conversion, 
one would usually append the first row, excluding the initial entry, in reverse order. However, since 
the nine tensor components are from the dyadic product in Eq. [21 the Toeplitz matrix is either 
symmetric or non-symmetric by sign only. Hence, it is possible to just mirror the column itself 
and, if necessary, multiply the mirrored entries by ' — 1'. The x conversion creates section '2', the y 
conversion creates section '3', and the z conversion creates section '4' in Fig. El The zero-padding 
of the three vector-component arrays is much simpler. Each component is trivially embedded in the 
[0 — > /ST — 1] X [0 ^ J — 1] X [0 — *■ P — 1] octant of a, K' x J' x P' array of zeroes, i.e., it would 
occupy section '1' in Fig. [51 and the rest of the array would be identically zero. 

The algorithm requires the DFT of the six tensor-component arrays, '"''cOsd, . . . , '"''c53D, and the 
three vector-component arrays, '"''xOsd, '™''xl3D, '"''x23D, yielding Y1,Y2,...,Y5 and X0,X1,X2, 



respe ctively. Note that for the practical application of the theo ry, as with DDSCAT (jDraine. B. T. and Flatau. P. J 



2004 ) and ADDA ijYurkin. M. A. and Hoekstra. A. G.l . l2008f) . for maximum performance, if neces- 



sary, the transformations described above are all zero-padded to ensure tha t the final dimensions of 



the compon ent arrays permit the use of fast DFT algorithms, i.e., for FFTW (jFrigo. M. and Johnson. S. G 



1998ll2005l ). a dimension which satisfies 2''3^5'^7''ll'^13'^, where a, b, c, d are arbitrary and {e + f < 1). 
These are illustrated in Fig. \E\ as kf, jf, and pf, and increase the dimensions to K" = K' + kf = 
2K -1 + kf, J" = J' +jf ^2J-1+ jf, and P" = P' + pf = 2P - 1 + pf. Subsequently, the 
contribution from each of the individual components, to the full circulant matrix- vector product, is 
obtained via the element-wise tensor- vector product of their transforms. 



YO,,,-fcXO,,j-fe 
Yli j-,fcXOi j-^fc 
Y2i j^feXOi j-^fe 



" Yli j_fcXli j-^fc + Y2i j-,feX2j; 

■ Y3i j-^fcXli j-^fc + Y4i j-^feX2i j_/c 

■ Y4i j-^fcXl,; -|- Y5i j-^feX2i 



(6) 
(7) 
(8) 
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i = 0, ...,A"-l,j = 0, ...,J'-l,fc = 0, ...,P'-1 

The inverse DFT (iDFT) of QO, Ql & Q2 provides yOsD, ylsD & y23D, which are the resuhant of 
Eq. [SI in 3D form, and the x,y, z components of the original tensor- Toephtz matrix-vector product, 
[«]y ^ [p][.7][A']rp[jvi^^ obtained by extracting the [0 isT - 1] x [0 ^ J - 1] x [0 ^ P - 1] octant 
of yOsD, yl3D & y23D, respectively, i.e., '1' in Fig. This reduces the complexity from 0{[iNd\'^) 
operations and storage to 0(12A^' In A^') and 0{%N'), respectively. 
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Table 1: Signs to account for the non-symmetries in sign only for the T-to-c conversions of the six 
independent tensor components from the dyadic product in Eq. O 





XX 


xy 


xz 


yy 


yz 


zz 


X 


DFT 


+ 1 


-1 


-1 


+1 


+1 


+1 


y 


DFT 


+ 1 


-1 


+1 


+1 


-1 


+1 


z 


DFT 


+ 1 


+ 1 


-1 


+1 


-1 


+1 



Table 2: Complexity estimations for the original and new algorithms. 



COMPUTATION 


Original Algoritlim 


New Algorithm 


Y0^Y5:x 
Y0-»Y5:y 
Y0-^Y5:z 


0{6N" In A'") 
0{6N" In J") 
0(6 AT" InP") 


0{6JPK" \nK") 
0{6[K + kf] PJ" In J") 
0{6[K + kf] [J + jf] P" In P") 


X0^X2:x 
X0^X2:y 
X0^X2:z 


0{3N" In A'") 
0(37V" In J") 
OiSN" InP") 


0{3J"PK" \nK") 
0{3KPJ" In J") 
0{3N" InP") 


Q0^Q2:x 
Q0^Q2:y 
Q0^Q2:z 


0{'3N" In if") 
0{3N" In J") 
0(37V" InP") 


0{3J"PK" In if") 
0{3KPJ" In J") 
0{3N" InP") 








STORAGE 


Original Algorithm 


New Algorithm 


Y0-^Y5 
X0^X2 


0{6N") 
0{3N") 


0{6[{K + kf) X {J + jf) X (P + pf)]) 
0(3 [KJ"P]) 
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Tabic 3: Performance of the full serial, OpenMP & MPI implementations of the new framework on 
NEMO using 4 processors for the parallel versions. SV: Number of starting vectors, NI: Number of 
iterations, CT: Convergence time, SU: Speedup, Mem: Memory required in total. Node: Memory 
required per node. 





SERIAL 


OpenMP 


MPI 


Scheme 


SV 


NI 


CT 


Mem 


CT 


SU 


Mem 


CT 


SU 


Mem 


Node 








(s) 


(GB) 


(s) 






(GB) 


(s) 






(GB) 


(GB) 


DlCg 
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1439.7 


3 


1 


404.3 


3.56 


3 


2 


604.4 


2 


4 


3 


9 


1.0 






21 


741.5 


2 


6 


212.7 


3 


5 


2 


6 


306.2 


2 


4 


3 


3 


0.8 


bicgstab 




12 


846.6 


3 


1 


234.7 


3 


6 


3 


2 


353.8 


2 


4 


3 


9 


1.0 


eg 




21 


746.6 


2 


6 


209.7 


3 


6 


2 


6 


320.9 


2 


3 


3 


3 


0.8 


cgs 




13 


911.3 


3 


1 


257.4 


3 


5 


3 


2 


382.9 


2 


4 


3 


9 


1.0 


mlbicgstab 


1 


12 


836.0 


3 


3 


238.9 


3 


5 


3 


3 


341.5 


2 


5 


4 





1.0 


mlbicgstab 


2 


8 


811.7 


3 


9 


246.2 


3 


3 


3 


9 


347.1 


2 


3 


4 


6 


1.2 


mlbicgstab 


3 


6 


842.9 


4 


5 


266.2 


3 


2 


4 


5 


369.6 


2 


3 


5 


2 


1.3 


qmr 
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3704.6 


3 


5 


1055.1 


3 


5 


3 


5 


1636.4 


2 


3 


4 


3 


1.1 


qmrayni 
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759.6 


3 


1 


216.7 


3 


5 


3 


2 


320.7 


2 


4 


3 


9 


1.0 


rbicgstab 


1 


14 


996.3 


3 


1 


284.5 


3 


5 


3 


2 


456.0 


2 


2 


3 


9 


1.0 


rbicgstab 


2 


7 


990.5 


3 


5 


285.2 


3 


5 


3 


5 


415.5 


2 


4 


4 


3 


1.1 


rbicgstab 


3 


4 


851.1 


3 


9 


242.2 


3 


5 


3 


9 


357.7 


2 


4 


4 


6 


1.2 


tfqmr 




13 


1850.3 


3 


5 


524.7 


3 


5 


3 


5 


795.1 


2 


3 


4 


3 


1.1 





cpus 


Iteration 


Product 


Tensor 


components 


Memory 










Creation 


DFT 


Total 






BiCGSTAB 






(s) 


(s) 


(s) 


(s) 


(s) 


(GB) 


ADDA 


1 


77.3 


37.6 






65.5 


3.7 


openDDA_S 


1 


68.3 


32.7 


1.8 


11.6 


13.4 


3.1 


openDDA_OMP 


4 


19.8 


9.2 


0.5 


3.3 


3.8 


3.2 





BICGsym 






(s) 


(s) 


(s) 




(s) 


(GB) 




ADDA 




1 


40.2 


37.6 






65.3 


3.1 


openDDA_S 


1 


33.9 


32.4 


1.8 


11.6 


13.3 


2.6 


openDDA_OMP 


4 


10.5 


9.6 


0.5 


3.3 


3.8 


2.6 




















(s) 


(s) 


(s) 


(s) 


(GB) 




ADDA 




1 


40.7 


38.4 






65.5 


3.7 


openDDA_S 


1 


35.1 


33.1 


1.8 


11.6 


13.4 


3.1 


openDDA_OMP 


4 


10.2 


9.2 


0.5 


3.3 


3.8 


3.2 



Table 4: Comparative benchmark run on NEMO between ADDA 0.76 and the double-precision serial 
and OpcnMP-based shared-memory variants of OpenDDA using BiCGSTAB, BICGsym & QMRgyj^j, 
for K = J = P ^200, which produced a computational domain with iV = 8 x 10^ sites in total and a 
representation of the spherical target with Nd~4:, 188, 896 occupied sites. 
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SYSTEM SIZES FOR THE COMPARISON 



Nodes 


K=J=P 


N 


Nd 


4 


240 


13,824,000 


7,238,592 


8 


310 


29,791,000 


15,599,512 


16 


380 


54,872,000 


28,732,192 


32 


480 


110,592,000 


57,904,928 


64 


600 


216,000,000 


113,098,912 



Table 5: System size, associated size of the computational domain and the number of occupied sites 
in the computational domain for the MPI-based comparative benchmark with ADDA across on 4, 
8, 16, 32 & 64 nodes on WALTON. 





kf K 




Figure 1: Outlines the optimisation of the requisite 3D DFTs via the implicit T-to-c conversion 
and simultaneous DPT of the tensor-component arrays and, for the vector components, by only 
performing transforms that are non-zero or that will contribute to the final answer (a) Tensor 
components (x direction) (b) Tensor components (y direction) (c) Tensor components (z direction) 
(d) Vector components (y direction) (e) Vector components (x direction) (f) Vector components (z 
direction). 
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cpus 


Iteration 


Product 


Tensor components 
Creation DFT Total 


Memory 
Total Per node 


BiCGSTAB 




(s) 


(s) 


(s) (s) (s) 


(GB) (GB) 


ADDA 
openDDA_MPI 


4 
4 


64.8 
54.5 


31.2 
25.9 


41.7 

17.4 0.9 18.2 


7.6 2.2 

6.7 1.7 


ADDA 
openDDA_MPI 


8 
8 


76.8 
63.9 


36.8 
30.6 


49.9 

24.4 0.9 25.3 


15.4 2.3 
14.7 1.8 


ADDA 
openDDA_MPI 


16 
16 


79.9 
66.1 


36.8 
31.0 


45.2 

19.4 0.8 20.3 


27.6 2.1 
27.5 1.7 


ADDA 
openDDA_MPI 


32 
32 


81.5 
71.6 


38.2 
34.1 


44.6 

23.3 0.8 24.1 


54.3 2.0 

54.4 1.7 


ADDA 
openDDA_MPI 


64 
64 


98.1 
92.2 


46.9 
43.7 


62.9 

39.4 0.8 40.2 


116.7 2.2 
108.9 1.7 






















(s) (s) (s) 


(GB) (GB) 


ADDA 
openDDA_]VIPI 


4 
4 


33.1 
27.3 


30.6 
25.9 


41.2 

17.3 0.9 18.2 


6.7 1.8 
5.7 1.4 


ADDA 
openDDA_MPI 


8 
8 


37.6 
32.3 


35.5 
30.5 


49.2 

24.2 0.9 25.1 


13.3 1.9 
12.6 1.6 


ADDA 
openDDA_MPI 


16 
16 


39.9 
30.1 


38.2 
30.4 


45.8 

19.9 0.8 20.8 


23.8 1.7 
23.6 1.5 


ADDA 
openDDA_]VIPI 


32 
32 


40.6 
35.8 


37.9 
34.1 


45.7 

23.0 0.8 23.8 


45.6 1.7 
46.6 1.5 


ADDA 
openDDA_MPI 


64 
64 


50.5 
45.8 


47.1 
43.2 


61.9 

38.4 0.8 39.2 


101.5 1.8 
93.7 1.5 






















(s) (s) (s) 


(GB) (GB) 


ADDA 
openDDA_]VIPI 


4 
4 


33.2 
27.7 


32.5 
26.0 


41.8 

17.6 0.9 18.4 


7.6 2.2 

6.7 1.7 


ADDA 
openDDA_MPI 


8 
8 


39.2 
32.8 


35.9 
30.2 


48.1 

24.5 0.9 25.3 


15.4 2.3 
14.7 1.8 


ADDA 
openDDA_MPI 


16 
16 


41.0 
33.3 


37.4 
31.2 


46.3 

20.1 0.8 20.9 


27.6 2.1 
27.5 1.7 


ADDA 
openDDA_MPI 


32 
32 


40.3 
37.6 


38.1 
35.3 


46.5 

22.7 0.8 23.5 


54.3 2.0 

54.4 1.7 


ADDA 
openDDA_MPI 


64 
64 


50.8 
46.9 


47.5 
43.8 


61.4 

38.2 0.8 39.0 


116.7 2.2 
108.9 1.7 



Table 6: Comparative benchmark run on WALTON between ADDA 0.76 and the double-precision 
MPI-bascd distributed-memory variant of OpenDDA using BiCGSTAB, BICGsym & QMR^yj^j, 
across 4, 8, 16, 32 & 64 nodes. 
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DDSCAT 


OpenDDA (Serial) 


OpenDDA 


(OpenMP[4]) 


OpenDDA (MPI[4]) 


Q,^t = 1.264 


Q^^t = 1.265 


Qext 


= 1.265 


Qe:it=1.265 


Qabs=0.911 




Qabs 


=0.911 


Qaba=0.911 


Q^^„ =0.353 


Q^^„=0.3.54 


Qsca 


=0.354 


Q,^„=0.354 



Table 7: Results for the operational verification of the three architectural variants of the new 
OpenDDA framework. Parameters: CUBE, K = J= P= 100 ^N^Nd^lx 10^ normal 
incidence on one face of the cube, incident wavelength A = 3.175 /iin, incident polarisation Co = x, 
effective radius a = 0.5 /xm, and complex refractive index m = 1.63631 + 0.372i. 




Figure 2: Load balance comparison for block and 'cyclic-plane' decomposition for a simplistic spher- 
ical target based on an 18 x 18 x 18 computational grid. 
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0.0 5.0e+5 1.0e+6 1.5e+6 2.0e+6 0.0 6.0e+7 1.0e+8 1.5e+8 2.0e+8 

Tensor Matrix Order N Tensor Matrix Order N 




0.0 5.0e+7 1.0e+8 1.5e+8 2.0e+8 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1e+8 1e+9 

Tensor Matrix Order N Number of dipoles 



Figure 3: Time and memory required to compute the matrix- vector product on FRANKLIN. N.B. 
The label 'Tensor Matrix Order N' refers to a matrix of size N, where the elements are tensors, i.e., the 
matrix is actually 3N x 3A*'. (a) Short-range timings; (b) Long-range timings; (c) FFTW DFT plan creation 
timings; (d) Approximate memory requirements. 3D (Orig): Original 3D DFT algorithm for all DFTs & 
iDFTs. ID (Orig): Original algorithm using ID DFTS. ID (T New): Hybrid system, new algorithm for 
the tensor DFTs and the vector DFTs & iDFTs are as in ID (Orig). ID (T & V New): Hybrid system, 
new algorithm for aU tensor & vector DFTS and only the iDFTs are as in ID (Orig). ID (T, V & I New): 
Complete new algorithm for aU DFTs & iDFTs. ID OMP # (T, V & I New): New OpenMP algorithm 
using '#' threads. ID MPI (T, V & I New) # nodes: New MPI algorithm using '#' nodes. Intel 
MKL CBLAS zgemv: Explicit calculation using the Intel MKL (|lntel Corporation. . ,2007. '). 
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I^H Data compression 

Normalisation & diagonal contribution 
DFT of the 3 vector components, FD multiplication 
and iDFT of the resuitant 
rrn DFT of the 6 independent tensor components 
Initialisation of the 3 vector components 
(Embedment + zero-padding) 



llliiii 



- Linear 

- Initialisation of the 3 vector components 
(Embedment + zero-padding) 

- DFT of the 6 independent tensor components 
DF T of the 3 vector compone nts, FD multiplication > 
and iDFT of the resultant 

- Normalisation & diagonal contribution 

- Data compression 

- Total 



0123456789 10 11 12 13 14 15 16 17 18 19 20 21 
Number of processors 



123456789 10 11 12 13 14 15 16 17 18 19 20 21 
Number of processors 



Figure 4: OpcnMP scalability and timing run on FRANKLIN for a spherical target with K ^ 
and A^^ 113, 089, 912 occupied sites. 




32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 



Number of nodes Number of nodes 



Total (Computation + communication) • DFT of the 3 vector components (Communication) 

Initiaiisation of the 3 vector components (Embedment + zero-padding) • IDFT of the 3 vector components (Communication) 

DFT of the 6 independent tensor components (Computation only) r Normalisation, diagonal contribution & data compression 

DFT of the 6 independent tensor components (Communication) ■ Total communication 

DFT of the 3 vector components, FD multiplication & IDFT of the * Total computation 

resultant (Computation only) Linear 

Note: Communication includes locai transposes + actuai communication 




Figure 5: MPI scalability and timing run on WALTON with associated speedups for a spherical 
target with K = J = P = 600 and A^rf = 113, 089, 912 occupied sites. 
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